Hi @Josiah_Ritchie,
Sounds like what you’re looking for is some kind of logging / metric aggregation (either Logstash + Kibana, or Splunk for logging aggregation and visualisation, or Grafana for metrics visualisation) and alerting based on that data.
You could, for example, run something like statsd or collectd on each mail server instance with a plugin that captures the metrics you’re looking for, and posts to Graphite (or implement Prometheus and make it a pull model). You could even implement the checks (i.e. metrics polls) in Nagios and use Nagios (or Sensu, et al) to generate the alerting. Grafana (over Graphite or some other metrics backend) can also alert you.
This is how we do it (or a combination of all of the above) - preferably at a global service level (i.e. mail inbound
), but sometimes also on individual instances (which is more noisy).
If you’re looking for something that can aggregate and visualise your metrics but don’t want to have to set it up / host it yourself, look to some of the -aaS providers like Datadog (you can also use them as a PagerDuty alert trigger source).
In the end it’s all about how transparent you can make your own metrics - I like to think of exposing, aggregating and displaying metrics as your monitoring (Datadog, Grafana, Prometheus, etc.), and only once thresholds or trend analysis is applied, it is used for alerting (PagerDuty).
If your PagerDuty Services are specific enough, you can use Extensions
-> Add-ons
to include your dashboards inline in Services (but it’ll be bound to a Service, and visible in Incidents belonging to that Service, and you can’t specify inline display links dynamically with the alert - only statically per Service).
Hope that helps,
@simonfiddaman